Integrated System Design Magazine

Editorial
Today's News
News Archives
On-line Articles
Current Issue
Current Abstracts
Magazine Archives
Subscribe to ISD

Directories:
Vendor Guide 2001
Advertiser Index
EDA Web Directory
Literature Guide
Event Calendar

Resources:
Resources and Seminars
Special Sections
High-tech Job Search

Information:
2001 Media Kit
About isdmag.com
Writers Wanted!
Search isdmag.com
Contact Us

Custom FPGA-Based Emulators Accelerate IC Design Verification

The regular, hierarchal structure that can be created in a custom FPGA means you can compile designs on a single workstation 5 to 20 times fast than traditional emulators can.

By Barry L. Hu

A new generation of emulators that incorporate custom field programmable gate arrays (FPGAs) optimized for emulation is beginning to have a dramatic impact on integrated circuit (IC) design and verification speed. Custom FPGAs enable new architectures that are highly scalable for increased capacity, incorporate special debugging logic and provide for higher emulation and compilation speed than off-the-shelf FPGA-based emulators. The regular, hierarchical structure that can be created in a custom FPGA means you can compile designs on a single workstation 5 to 20 times faster than traditional emulators can. The new emulators also run faster because building special silicon makes it possible to run emulation several times faster with 100 percent internal visibility. A logic analyzer-switching matrix built into custom FPGAs eliminates the large amount of time required with conventional emulators built on generic FPGAs to recompile the probes. These improvements can more than double debugging productivity, making it possible to fix two or three bugs per day compared to one per day today.

The astonishing power of today's integrated circuits has enabled a tremendous range of products and made those products more powerful, less expensive, and easier to use. But the same dramatic increase in power and complexity that has enabled these advancements has also turned the process of designing, verifying, and ultimately, taping out these chips to silicon into a task of almost unbelievable complexity. Deep-submicron (DSM) process technology means more gates on the chip and system-on-a-chip(s) (SOCs) that often include intellectual property from multiple sources, along with memory and software. And, of course, all of this logic has to work together, error-free, all of the time. Today, when a design goes to the fab, it must arrive with a strong degree of confidence that all of the bugs have been discovered and fixed, that the chip will work in the context of the system, and that it delivers the required functionality and performance.

Testing designs in the intended operating environment

The technology of in-circuit emulation addresses the issues above by allowing designers to quickly create a hardware model of a chip design using proprietary emulation software that maps the design onto reprogrammable circuitry. In-circuit emulators simulate the operation of a complex IC by partitioning the circuit into blocks and compiling the blocks onto hundreds of FPGAs that are linked together to emulate its operation. The resulting virtual silicon is a timing-correct, functional equivalent of the actual chip, which can be plugged into the system being designed and run real software. In many applications, it's impossible to determine in advance all the corner cases or worst case scenarios and test them. In-circuit emulation avoids the need for this since it makes it possible to test the design in the intended operating environment. This approach offers a major improvement over software simulation which is too slow to verify all of the intricacies of actual chips, especially complex ones, and doesn't model the interactions between the chip and other system components and software. Using in-circuit emulation designers can spot bugs and correct them immediately, view the chip design running in its target environment, and start integrating the system before the IC is even built. Emulation typically reduces the time to market of new designs dramatically, and reduces development costs by eliminating the expense of silicon respins.

Up to now, in-circuit emulators have used off-the-shelf commercial FPGAs as building blocks. The obvious reasons for this are that commercial FPGAs offer a high price-to-performance ratio because their development costs are shared among many other customers for these devices, and because manufacturing costs are lower because of their relatively high volumes. But, in recent years, the commercial FPGA market has sharply diverged from the needs of emulation (see Figure 1).

Debugging concerns

Commercial FPGAs are providing rapid gains in chip capacity, but, on the other hand, the I/O-to-gate ratio is dropping and compilation times are increasing. This makes perfect sense in a situation where an entire design can be incorporated onto a single chip-the typical use of commercial FPGAs. But, when a large design is mapped into an emulator for verification, the design is partitioned into hundreds of FPGAs and the limiting factor on capacity becomes the number of I/O pins to interconnect the devices. Rent's rule says that the number of usable gates in a multiple chip design is equal to (pins/1.2) 1.7. As the chart of commercial FPGAs below shows, increasing the raw gate count by a factor of 200 may only increase the gates usable in emulation by a factor of five (see Table 1).

One additional problem is that implementing the debugging capabilities required in emulators into a conventional FPGA requires the construction of scan chains built from flip-flops that clock the data out of the chip into the logic analyzer. This feature consumes quite a bit of logic in a commercial FPGA, reducing the capacity of traditional emulators and driving up their cost. The number of signals that can be probed is limited and the emulator needs to be recompiled in order to change the signals that are probed, which is frequently required during debugging. Just moving a single probe takes an average of 30 minutes because it requires recompiling the FPGA containing that probe.

The most advanced emulators built on conventional FPGAs do offer a mode that provides full internal visibility, probing every signal in the circuit, but it consumes a considerable amount of the logic, typically 30 percent.

Another problem is that the overhead created by providing 100 percent internal visibility reduces the speed at which signals can be sampled to around 500 kHz. This rules out the use of this visibility mode in many applications.

Emulators normally are limited to running at about 2 times oversampling-the debug sampling clock is supposed to run at about two times the system clock. Using a 250 kHz system clock, which is the fastest that emulators based on conventional FPGAs can deliver full 100 percent internal visibility, many circuits act abnormally. DRAMs aren't refreshed fast enough, PCI buses don't perform as they are supposed to and Pentium chips begin acting up.

Compiling designs for emulation

Another problem is that the time involved in compiling a design for in-circuit emulators based on commercial FPGAs may be too high to meet fast-track design schedules. Traditionally, emulation required register transfer level (RTL)-to-gate level mapping using a silicon synthesis tool that converts the RTL to a net list of gates and flip-flops prior to each design turn. The problem was that this process by itself could easily take a day. Next is emulation mapping, which breaks up the gate design into hundreds of FPGA- sized chunks. The final step is compiling each segment of the design onto an individual FGPA. A couple of years ago, a new generation of RTL-to-gate mapping tools arrived that optimized for emulation and provide dramatic increases in speed in an integrated RTL debugging environment. This new generation of mapping tools gained their efficiency from the fact that they were not designed to produce commercial quality silicon but simply to perform the task at hand, getting into emulation quickly. These new tools support both Verilog and VHDL and make it possible to begin emulating as soon as the RTL design is available. In a typical 1.1 million gate networking design, overall compile time using five workstations is reduced from 32 hours using the old approach to 5 hours with the new one.

With these improvements factored in, the majority of the time required for compilation, more than 3 hours in this example, is consumed by FPGA compilation. The reason that FPGA compilation takes so much time is that conventional FPGAs, in order to optimize gate count, have a highly irregular configuration in which, for example, there are many different types of routes running between the cells. Compiling an emulated circuit over this irregular network is a challenging task that requires a large amount of computing power. This isn't a major problem in the commercial FPGA world where most designs require one or two FPGAs and are compiled no more frequently than once or twice per week. But for emulation, designs are frequently compiled daily and involve compiling several hundred FPGAs. As a result, farms of workstations or PCs that take time and money to assemble are often used, which drives up the cost of the design process and often means that compiles can only take place during odd hours when computing resources are available.

Custom FPGA technology

Emulators based on custom FPGA technology can overcome all of these problems. The architecture of the FPGA can be designed around the needs of emulation rather than optimized for building an SOC. With custom FPGAs, for example, it becomes possible to create a regular, consistent, crossbar-connected architecture. At the very lowest level, the logic elements are much like the configurable logic blocks in a conventional FPGA. One difference is that a 6-input lookup table can be used instead of the 4-input table used in traditional FPGAs. A 6-input lookup table is more efficient in the complex task of emulation, often making it possible for a single element to perform a function that would require two in a conventional FPGA. This, in and of itself, increases the emulation capacity of the device. The individual cells are organized into uniform level zero blocks that are organized into uniform level one blocks and so on (see Figure 2).

Equally important differences between conventional and custom FPGAs arise in the way the blocks are interconnected. In the custom FPGAs optimized for emulation used in the Mercury Plus in-circuit emulation system from Quickturn (San Jose, CA), level zero blocks are connected with a crossbar matrix to form a level one block. The structure is replicated throughout the FPGA and level one blocks are crossbar interconnected to form a uniform structure of level two blocks. Likewise, the level two structure is replicated throughout the device and connected with a crossbar matrix. This architecture is optimized for the normally time-consuming compilation task rather than squeezing in the last extra gate. Higher level blocks of the circuit being emulated fit naturally into the higher level blocks of the FPGA. Likewise, each high-level block can naturally be divided into lower level blocks based on the functionality of the circuit and connected with adjacent crossbars.

The hierarchical crossbar architecture also provides consistent and easily predictable delays throughout the device, greatly reducing the computational challenge involved in routing a circuit. This new architecture means that there is a fixed delay within any logic level, between any blocks at the same level and between blocks of different levels. This makes it easy to compute the delays between any two logic elements in the device. Little computational power needs to be devoted to performance optimization; the equal length of the crossbars means that it makes little difference where blocks are located.

The net result is that compilation time can be reduced by an order of magnitude or more. For example, the time required to compile a 1.1 million gate networking design is reduced from 5 hours on 5 workstations to 1.1 hours on 3 workstations, with the result that a farm of workstations or PCs is no longer required (see Figure 3). As a general rule, million gate designs can be compiled in less than an hour. Engineers compiling large designs on only one or two workstations can usually compile designs from five to twenty times faster than with emulators based on generic FPGAs. In many situations, this makes it possible for the compile to be performed at the exact time when it's needed, rather than having to wait to run it overnight, and even allows multiple compiles to be run in a single day.

Built-in I/O multiplexing

Another critical improvement is that custom FPGAs developed for emulation have a very high pin count relative to the number of gates. The new design uses a hardware-based two-to-one I/O multiplexer ring around a sea of configurable logic blocks that doubles the effective pin count, more than tripling the number of usable gates. These features increase the effective capacity of a system built with custom FPGAs to 20 million ASIC gates.

The use of custom FPGAs in the Mercury Plus in-circuit emulation system also makes it possible to build a hardware-based logic analyzer, a feature that consumed precious resources of off-the-shelf FPGAs. It consists of a matrix of switches that makes it possible to record any signals in the circuit being emulated, including I/O.

Up to 64 signals can be multiplexed to each logic analyzer pin on the chip, although, of course, the time sequencing required to accomplish this level of multiplexing takes a toll on logic analyzer speed. The user can manage this tradeoff by selecting any number of signals to multiplex through each pin up to 64. The event detectors that are used to trigger the logic analyzer are built into the silicon, avoiding the need to route the signals outside the chip. At any point in time, emulators built with custom FPGAs can determine the current state of every signal in the entire design being tested. With the logic analyzer built into the FPGA, designers can move probes instantly, avoiding lengthy delays for compiling probes. The switch matrix can reassign signals to pins at run time, eliminating what is usually the most time-consuming step of the debugging process.

Higher emulation speed avoids problems

When you perform emulation, you have to slow down the real world to the speed of the emulator. This isn't normally a problem, but there is a limit beyond which anomalies begin to occur that make it impossible to accurately emulate a device. The use of custom FPGAs makes it possible to increase the speed of the logic analyzer to 2 MHz while still providing 100 percent visibility. This level of speed provides an accurate emulation of virtually any device, eliminating a limitation of generic FPGAs and making it practical to provide 100 percent internal visibility on every emulation. The result is a substantial improvement in debugging productivity.

The use of custom FPGAs also makes it possible to build in the capability to set, force and release any storage element such as a flip-flop or register without compiling. For example, at runtime the flip-flop or register can be set to 1 or 0, locked at 1 or 0, or released to its natural state. While these capabilities can be implemented in emulators built with off-the-shelf FPGAs, it's necessary to recompile the FPGA that contains the storage element, which takes about 30 minutes. The ability to perform these functions without this delay can significantly increase the speed of the debugging process. During debugging, designers frequently have a hunch of what is causing a problem and need a way to confirm it. The best way is usually to force signal values to change in the same way the proposed fix would. This is much faster than actually changing the design to model a proposed solution to the problem. Instead of adding a gate to turn off a signal at a certain time, you simply run the emulator to that point and force the signal to zero to see whether that really fixes the problem.

The fact that custom FPGA based in-circuit emulators can perform this task in a few minutes rather than a half-hour significantly increases the speed of the debugging process.

The inclusion of a complete logic analyzer with instant probe changes, event detectors, and pin multiplexing means that designers can move probes instantly, avoiding lengthy delays for recompiling probes.

Increasing debugging productivity

The usual limitations of providing 100 percent internal visibility in terms of sampling frequency, depth and capacity are eliminated. Several other improvements are also simultaneously reaching emulators that aren't directly related to the use of custom FPGAs. New firmware shortens the probe data upload time by over 10 times. Off-line debugging allows teams to share an emulator more efficiently than before, increasing usage for large teams by a factor of 3 to 10. Taken together, these improvements make it possible to find and fix two to three bugs per day compared to one bug per day in traditional emulation systems (see Figure 4).

All in all, the shift from standard FPGAs to a silicon architecture that is optimized for emulation delivers higher performance and capacity. The new generation of custom FPGA based in-circuit emulators can dramatically increase the number of design turns per day by increasing compile speed and debug productivity.

The use of custom FPGAs also increases the capacity of the new in-circuit emulators to handle the large chips of tomorrow while maintaining the high modeling accuracy of previous generation emulators. These advantages combine to create a system that delivers significantly faster turnaround time and overall designer productivity.

Barry L. Hu is a senior staff emulation engineer at Intel (Santa Clara, CA). Previously, Hu was manager of system-level emulation at C-Cube Microsystems (Milpitas, CA).

To voice an opinion on this or any other article in Integrated System Design, please e-mail your comments to sdean@cmp.comd

Sponsor Links